Tech Arsenal 1

home *** CD-ROM | disk | FTP | other *** search

/ Tech Arsenal 1 / Tech Arsenal (Arsenal Computer).ISO / tek-01 / cuj1008.zip / RAMEY.ZIP / PSORT.NR < prev next >

Wrap

Text File | 1991-09-04 | 24KB | 560 lines

.sp 15 .ce 3 The Postman's Sort .sp 1 Robert Ramey .sp 3 .ce 1 abstract .sp 1 This article describes a program that sorts an arbitrarily large number of records in less time than any algorithm based on comparison sorting can. For many commonly encountered files, time will be strictly proportional to the number of records. It is not a toy program. It can sort on an arbitrary group of fields with arbitrary collating sequence on each field faster than any other program available. .bp .SH Introduction .PP Forget for a moment everything you know about computer science. .PP If you were given a stack of playing cards and asked to sort them by suit, how would you go about it. .PP Most people would deal the cards into four piles, one for each suit. After dealing the last card the piles would be picked up one after the other. Sorting a deck of cards this way into four suits would take about a minute. .PP Suppose that you were given 100 decks of cards into suits. Would you use a different method? How long would it take you to do it. It should be pretty clear that the same method would be used by most people and that it would take 100 times as long as for sorting one deck of cards into four suits. .PP So here we have a sorting method which will time strictly proportional to the number of items to be sorted. This has been known as distribution sorting. .PP Most computer professionals "know" that time to sort a file of records must increase disproportionately to the number of records in the file. Its a good thing that the rest of us don't know it. We might never get our mail. .PP In fact, most will cite Donald E. Knuth's book .UL The Art of Computer Programming:Searching and Sorting Knuth's book contains a chapter on minimum comparison sorting which shows that a sorting method which compares records can do no less than n lg2(n) (note: that's log base 2) comparisons. However, this analysis specifically excludes methods that to not use key comparison. In fact, other sections of the book allude to methods whose time is proportional to the number of records sorted. .PP Then, why does it seem that every sort utility uses quicksort? Practical sort utilities have to be able to handle a wide variety of keys and key types and be fast. The methods described in Knuth's book that to not use comparison have depended on the use of fixed length keys. Also, although they might be faster for a sufficiently large number of records, they are slow on the numbers of records usually encountered. For example, to sort a group of records on a social security number using radix list sort would require 30 passes through the file! .PP There is a way to overcome these deficiency. In fact the post office uses it every day. While researching this article (after writing the program), I found that Knuth alluded to it and left it as an exercise for the reader. .SH "A Generalized Distribution Sorting Algorithm" .PP When a postal clerk recieves a huge bag of letters he distributes them into other bags by state. Each bag gets sent to the indicated state. Upon arrival, another clerk distributes the letters in his bag into other bags by city. So the process continues until the bags are the size one man can carry and deliver. This is the basis for my sorting method which I call the postman's sort. .PP Suppose we are given a large list of records to be ordered alphabetically on a particular field. Make one pass through the file. Each record read is added to one of 26 lists depending on the first letter in the field. The first list contains all the records with fields starting with the letter "A" while the last contains all the records with fields starting with the letter "Z". Now we have divided the problem down to 26 smaller subproblems. Now we address subproblem of sorting all the records in the sublist corresponding to key fields starting with the letter "A". If there are no records in the "A" sublist we can proceed to deal with the "B" sublist. If the "A" sublist contains only one record it can be written to output and we are done with that sublist. If the "A" sublist contains more that one record, it must be sorted then output. Only when the "A" list has been disposed of we can move on to each of the other sublists in sequence. The records will be written to the output in alphabetical order. When the "A" list contains more than one record it has to be sorted before it is output. What sorting algorithm should be used? Just like a real postman, we use the postman's sort. Of course we just apply the method to the second letter of the field. This is done to greater and greater depths until eventually all the words starting with "A" are written to the output. We can then proceed to deal with sublists "B" through "Z" in the same manner. .PP The above algorithm is implemented in the accompanying program. The function sort() is passed a sublist. First it is determined how many records the sublist has. If its one we're done after writing the record. If the sublist is empty we can just return. Otherwise we advance the pointer into the sort key and continue. plan() and allocate() reserve memory on the stack of sublist pointers. dist() distributes the input sublist among the newly created sublists. dsort() calls sort() for each of the newly created sublists. .SH "Reality Check" .PP A practical sort utility must be able to sort on any fields in any sequence with any collating sequence. The command line syntax permits specification of delimited fields, start of key within each field, and specification of collating sequences for each field. See the accompanying manual page for the sort program. Sorting proceeds according to each key field. The first character within a field which corresponds to a zero collating sequence terminates the field. The rest of the field is not considered in the sort. Sorting continues on remaining key fields. This means that "A<0>C" might be written to output before "A<0>B" in the final output. If this behavior is not desired, one should make sure that every character that could appear within a field is assigned a non-zero collating sequence value. .PP To be really useful the program must take into account that records can have varying numbers of fields, fields can have varying number of characters and that the command line may specify fields and/or key characters that a particular record does not contain. In general keys corresponding to non-existant fields are distributed with a collating sequence of zero. This turns out to be more natural, convenient and flexible of the possibilities considered. This introduces some complications which require refinements of the algorithm. In .PP After distribution is completed by dist(). The newly created sublists are processed by dsort(). Sublists corresponding to a zero collating value are handled first. For these sublists, dsort() advances key pointers to the next key field before continuing to distribute these sublists. The remaining sublists are handled as previously described. .SH "Speeding things up" .PP Is this the best we can do? Suppose we wanted to sort 100 decks of shuffled playing cards by value and suit. We could first distribute by suit and then distribute each pile by value. But we probably wouldn't. What we would probably do is to arrange 4 rows and 13 columns. Each card would be dealt to the pile corresponding to its suit and value. All the cards could be sorted in one pass. .PP When we're sorting on the basis of a key field we can do the same thing by distributing among a larger number of sublists. In one pass we would have sublists "\0?", "A\0", "AA", "AB",...,"AZ", "B\0", "BA", ...,"ZY","ZZ". Of course information on 1 + 26 * (1 + 26) = 703 sublists must be maintained in memory, but the effect is to reduce the number of times the key has to be addressed by half. The basic principle is to sort on the basis of as many bytes at a time as possible. If we want to include all 8 bit bytes in the collating sequence we are going to need 64k sublists to sort on two bytes at a time. This usually not going to be too practical (at least on Intel type processors where the maximum size data segment is 64k). Fortunately, in the real world we usually know something about the fields we want to sort on. For example, to create a telephone directory, we sort on peoples names last name first. These fields will only contain alphabetic characters. Hence for this purpose we can take two bytes at a time. Social security numbers only contain decimal digits so we can take 3 or 4 at time before things get out of hand. .PP The function plan() called from sort() determines the best number of bytes to take given the collating sequences defined for the key fields. It also sets up pointers to be able to find the key fields. The program is designed so that we can take bytes from more than one field on one pass. This should assure us that the maximum number of bytes possible has been used to distribute the record each time it is addresses. To sort a file of records on a social security number in the third field in format 999-99-9999 one could use the command line asort -k '0'-'9' -f 2 -c 0 -c 4 -c 7 The default setup for asort will allocate sublist space to handle three decimal digits. The first time a record is addressed it will be distributed on the first three digits. The second time '-' will be encountered at the next key position so that all the records with the same first three digits will added to sublist 0. the third time distribution will be on the middle two digits. Enough space will be allocated to distribute on three digits but only a small portion will be used as each field will be terminated by the '-'. The fourth distribution will be on the next three digits and the fifth will be on the last digit and field terminator. The best command line to sort a file of records based on a social security number in the third field in format 999-99-9999 would be asort -k '0'-'9' -f 2 -c 0-2 -c 4-5 -c 7-10 . In this case sort would pass only three times through the records. In general, the more information we supply in the command line the faster the sort is likely to be executed. .SH "Storage Management" .PP Sublists are implemented as linked lists of records so that moving records around is fast and simple. As records are read from the standard input they are processed by the first distribution pass. If all the records don't fit in memory a temporary disk file containing the sublists is created. Compiling with the MicroSoft C v5.21 compact memory model(that's small code/large data), it turns out that for an average fill with random alphabetic keys, a given record will be written and read on/from the temporary file at most once if the file to be sorted is less than 280MB. Of course if we're using MSDOS we run in to other problems first. .PP As usual a disproportionate part of the program is taken up with issues of memory management. Even so, won't go into it here as it peripheral to the subject of this article. .SH "Comparing apples and oranges" .PP Below I offer an informal analysis of the speed of the postman's sort for some different types of files. Comparative analysis of the comparison sorting and the postman's sort is somewhat difficult since the methods are so different. I have chosen to consider "primitive operations". By primitive operation I mean each time a record is addressed and/or manipulated in some way. If the primitive operation has some sort of inner loop such as a string comparison does, I count each cycle of the inner loop as a primitive operation. .PP For comparison sorts this will be comparing two bytes and perhaps moving records. For the postman's sort this will be manipulation of a byte from the key field. Of course, this is going to be a rough approximation but I think it will still be worthwhile. .PP First consider the case of a relatively short fixed length key is used for the sort. eg. sorting accounting transactions by date in format mmdd. The number of days in a month or year is small enough so that we would expect to complete the job in one pass through the file distributing records among 2*10*4*10 = 800 sublists. This is determined by using a key of command line of -k '0'-`1' -c 0 -k '0'-'9' -c 1 -k '0'-'3' -c 3 -k '0'-'9' In this case the number of primitive operations required by the postman's sort will be equal to the number of records to be sorted times four operations per record as there are four bytes in the record. Note that this is an example of a sorting method which requires time proportional to the number of records in the file. And its not just a toy problem. .PP Using any comparison sorting method the number of comparisons will be at least N*lg2(N). Since the number of records far exceeds the number of keys, The number of primitive operations per comparison will be close to 4. Using 3 as en estimate the total number of primitive operations will be about 3*N*lg2(N). The postman's sort will be faster than the fastest comparison sort by the following factor: .nf 3*N*lg2(N) --------- = .75*lg2(N) 4*N .fi For 100,000 records the fastest comparison sort will require 12 times more primitive operations than the postman's sort. .PP Next consider the case of a relatively long variable length key. Suppose we use as an example keys made of random alphabetic characters. Using the postman's sort and taking two characters at a time, the first pass through the file will generate 703 subfiles. Each subfile will in turn be distributed among 703 subfiles, etc. until each subfile has only one record. Hence each record will be addressed approximately lg703(N) times where N is the number of records in the file. The total number of primitive operations will be 2*N*lg703(N). .PP Using a comparison sort, most of the comparisons are between keys that are close together. If there are less the 26 records in the file we would expect most comparisons to terminate on the first byte. If there are less than 26*26 records in the file most comparisons will terminate on the second byte and so on. This works out to lg26(N) primitive operations for each comparison. The number of comparisons is at least N*lg2(N). So the total primitive operation would be N*lg2(N)*lg26(N) .PP For this file the postman's sort will be faster than the comparison sort by the following factor. .nf N*lg2(N)*lg26(N) lg2(N)*ln(703) ---------------- = -------------- = .9*lg2(N) (2*N*lg703(N)) 2*ln(26) .fi or almost 16 times faster for a 100,000 record file. .SH "Just How Fast Is It" .PP The above analysis is rather crude in that it makes a number of assumptions to make the calculations easier. The above analysis assumes that all types of "primitive operations" take the same amount of time to complete. Also a practical program spends a lot of time in storage management overhead which the theoretic approach doesn't take in to account. Next let's look at some real results. In order to test this sorting program, I generated test files consisting of variable length records of random alphabetic characters. The average record length is 15 bytes long. The files were 1000, 10,000 and 100,000 records long. (15K, 150K, and 1.5MB long). For comparison purposes I used the sort program from the MKS Toolkit from Mortice Kern Systems. I have assumed that it uses some sort of comparison sort. The results are summarized in the following table .nf .ta .ta 35R,50R,63R Comp. Sort Postman's Sort Speed Factor Time fixed key 3*N*lg2(N) 4*N .75lg2(N) Time Variable key N*lg2(N)*lg26(N) 2*N*lg703(N) .9*lg2(N) Sort 1K records 3 sec 3 sec 1.0 Sort 10K records 27sec 18 sec 1.5 Sort 100K records 558sec 290 sec 1.92 .fi My machine is a 12 MegaHertz AT compatible with two 28ms Seagate 251-1 disk drives. One drive was used for temporary files. The number of drives used affected greatly the time for comparison sort of a large file. It had little effect on the time required for the postman's sort. .PP Well it looks like the postman's sort isn't as fast my theory predicts. Buts its still twice as fast at the next best thing. .SH "Final Observations" One of the most gratifying things about the postman's sort is that the first records are produced on the standard output before the file has even completed reading! In our test file, records which contained just a newline character are output as they are read in. When the last record is read in the first non null records appear almost immediately. If one is working with a multi-tasking system with pipes, much time will be saved as the subsequent process can proceed parallel with the sorting process. So even though this sort is only twice as fast, the jobs that use it may be improved much more. .PP Well, there you have it. A program which sorts a file of N records in less time than it takes to make N*lg2(N) comparisons. For a large and common class of sorting keys, it will sort a file in a time proportional the the number of records in the file. .bp NAME psort - sort the standard input file SYNOPSIS psort [-t <dir>] [-s <record size>] .br [ [-k <keys>] ] [ [-f <range>] [-c <range>]... ]... DESCRIPTION psort sorts lines of the standard input file and writes the result on the standard output. Default key is the entire line. Default ordering is lexicographic by bytes in machine collating sequence. .in +4 .ti -4 -t use the following name for the temporary directory. Default is taken from environment variable %TMP% .ti -4 -s record size. if a range (eg. 10-124) is specified, records will be of variable length delimited by a newline. If a record larger than the indicated maximum size is encountered the program will terminate with and error message. If a single value is specified records will be fixed in length and the newline will have no special significance. If no record size is specified, variable length records upto 511 bytes are assumed. .ti -4 -k specify the collating sequence for the subsequently specified sorting fields. The collating sequence is specified as one or more ranges of values. Characters are assigned collating sequence in order of their specification. For example, to sort lower case alphabetic characters use -k 'a'-'z'. Any characters not specified will be assigned a collating value of 0. Characters in a field beyond a character with a 0 collating value will not be included within the sorted field. Hence, either of records b<0>a and b<0>c may precede the other in the output file. Within a key specification, any number of ranges may be specified. For example, if the sorting field will contain any combination of lower case letters, digits and spaces, use -k ' ' '0'-'9' 'a'-'z'. Spaces will sort before digits which will sort before lower case letters. If no key collating sequence is specified, a default collating sequence of all printable ascii characters is used. .ti -4 -r repeat previous collating sequence. For example, to fold upper case letters to lower case letters for purposes of determining sorting priority use -k 'a'-'z' -r 'A'-'Z'. This would assign the first character following the -r the same collating value as the first one assigned in the previous range. To give varying white space characters equal weight use -k ' ' -r '\t' -r '_' . .ti -4 -n numeric sort on the key. This is an alternative to -k. numeric fields may contain a leading sign and/or decimal point. Numeric fields should look like [' ']...[+|-]['0'-'9']...[.]['0'-'9']... Any non-numeric characters will terminate the field. The field will be sorted by numeric value. .ti -4 -i invert the sequence of the sort for the last key specified. .ti -4 -u output only records that are unique according to the sorting key fields. .ti -4 -f sort on one or more fields. Fields are groups of characters separated by a delimiter character. Fields are number starting at 0. That is -f 0 refers to the start of the record. A field specification may contain a range of fields as in -f 2-4 to indicate that sorting sequence is to be determined on the basis of the third, fourth and fifth fields in turn. A range must have a definite end. ie -f 2- is not permitted. A field range need not be increasing. ie -f 3-2 is permited and will sort first by the fourth then by the third field. .ti -4 -c sort one or more characters within the indicated fields. Start counting character positions from 0. For example -f 1 -c 2-3 would sort on the third and fourth characters of the second field. Several character ranges may be specified for a given field. For example -f 2 -c 5-6 -c 3-4 -c 1-2 would specify three sorting fields of 2 characters each with the third delimited field. When specifying a character range within a field, the second number must be greater or larger than the first. ie. -c 7-3 cannot be used. An indefinite character range can be specified as in -c 4- . This will indicate all characters starting with the fifth to the end of the field. .ti -4 <range> a range is used to specify ranges of fields, displacements within a field and collating values. The common syntax is <start>[-[<end>]] . <start> indicates a single value. <start>- indicates a range beginning at <start> to a large number. For example -f 2- would be used to specify all fields after the second. <start>-<end> indicates a range of fields. The start and end number can be in a number of formats: simple decimal numbers, numbers starting with 0 are taken to be octal, numbers starting with 0x are taken to be hexidecimal and characters within apostrophes are converted to there ascii value. Hence -k ' ' 'a'-'z' and -k 0x20 'a'-122 are equivalent. .ti -4 -d the following character is the field delimiter. For example -d '|' . The default field delimiter is a tab (0x09). .in -4 If no sorting fields are specified, the whole record is taken as a sorting field. Sorting proceeds according to the precedence indicated by the sequence of the sorting fields. Records with the same sorting fields will be output in an unpredictable sequence. Remember that characters not specified within a collating sequence are taken as collating value zero. This can result in unexpected behavior when fields are not the same length. Following is the result of sorting a small file with -k 'z'-'a'. .nf def cad basdf a aa .fi This was probably not the result intended. To get the desired result, use -k 'a'-'z' -i. .nf def cad basdf aa a .fi The following switches normally need not be used. They are included for purposes of fine tuning and debugging. .in +4 .ti -4 -m maximum memory in K to be allocated for sort. This can be useful in a multitasking environment so that psort doesn't suck up all the memory available in the system. .ti -4 -b specify the size of the buffers used for standard input/output in K. The default size is 30. .ti -4 -sb specify the size of the output buffer used to write data to the temporary file. Current value is 30. .ti -4 -rb specify the size of the input buffer used to read data from the temporary file. Current value is 30. .ti -4 -v specify visible mode. This displays statistics on each distribution pass in the file. It is useful for debugging and fine tuning. If only the top levels of distribution are desired use -v <number of levels>. .ti -4 -l length of segment used by internal storage in K. Current value is 16.